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Individual participant time-to-event data from multiple prospective epidemiologic studies enable detailed inves- 
tigation into the predictive ability of risk models. Here we address the challenges in appropriately combining such 
information across studies. Methods are exemplified by analyses of log C-reactive protein and conventional risk 
factors for coronary heart disease in the Emerging Risk Factors Collaboration, a collation of individual data from 
multiple prospective studies with an average follow-up duration of 9.8 years (dates varied). We derive risk prediction 
models using Cox proportional hazards regression analysis stratified by study and obtain estimates of risk discrim- 
ination, Harrell's concordance index, and Royston's discrimination measure within each study; we then combine the 
estimates across studies using a weighted meta-analysis. Various weighting approaches are compared and lead us 
to recommend using the number of events in each study. We also discuss the calculation of measures of reclassi- 
fication for multiple studies. We further show that comparison of differences in predictive ability across subgroups 
should be based only on within-study information and that combining measures of risk discrimination from case- 
control studies and prospective studies is problematic. The concordance index and discrimination measure gave 
qualitatively similar results throughout. While the concordance index was very heterogeneous between studies, 
principally because of differing age ranges, the increments in the concordance index from adding log C-reactive 
protein to conventional risk factors were more homogeneous. 

C index; coronary heart disease; D measure; individual participant data; inverse variance; meta-analysis; risk 
prediction; weighting 



Abbreviations: CHD, coronary heart disease; C index, concordance index; CRP, C-reactive protein; D measure, discrimination 
measure; NRI, Net Reclassification Index. 



The derivation and assessment of risk prediction models 
using multiple epidemiologic studies has several advantages 
in comparison with analysis of single studies. These include 
greater precision, reduced overfitting, and increased general- 
izability (1, 2). Availability of individual participant data 
from several studies, as opposed to aggregate-level statistics, 
also allows detailed characterization of risk prediction mod- 
els and investigation of potential effect modifiers (3). We pre- 
viously described methods for investigating exposure-risk 
relationships using individual participant data from multiple 
studies (3) and proposed measures of discrimination for the 



stratified Cox model (4). In this paper, we extend and dem- 
onstrate methods for assessing risk prediction models using 
individual participant data from multiple studies based on 
weighted meta-analysis techniques. We describe assessment 
of the predictive ability of a risk prediction model, the change 
in predictive ability upon moving from one model to another, 
and the comparison of predictive abilities across different 
subgroups of the population. Such techniques for combining 
information across studies have not previously been de- 
scribed in detail in the literature and are of relevance to the 
growing number of collaborative consortia (5-13). 
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We illustrate these methods using data from the Emerging 
Risk Factors Collaboration, which comprises individual rec- 
ords from over 2.2 million participants in 125 prospective 
studies of major cardiovascular disease outcomes and cause- 
specific mortality in predominantly Western populations 
(14-17). The studies include mostly prospective cohort stud- 
ies, but also some nested case-control and nested case-cohort 
studies. Examples presented in this paper focus on prediction 
models for coronary heart disease (CHD), defined as first 
nonfatal myocardial infarction or coronary death, and exam- 
ine the predictive ability of C-reactive protein (CRP) concen- 
tration when added to conventional risk predictors. Data on 
CRP and conventional risk predictors at baseline were avail- 
able from 37 prospective studies involving 165,856 partici- 
pants without a history of cardiovascular disease, among 
whom 8,806 incident CHD events occurred over an average 
of 9.8 years of follow-up (for definitions of study names, see 
Web Table 1 (available at http://aje.oxfordjoumals.org/)). 



METHODS 

Derivation of a risl< prediction model over multiple 
studies 

Initially we assume that all data are derived from prospec- 
tive cohort studies; other study designs are addressed later. 
Risk prediction models are constructed using Cox propor- 
tional hazards models (18), stratified by study and, if appli- 
cable, by other characteristics such as sex. For studies 
i = 1 , . . . , S, with strata A: = 1 , . . . , and individuals ; = 
I, . . . , Ns with baseline risk factors Xi = (xn, Xi2, . . . , Xjp), 



the probability of survival beyond time t after baseline takes 
the form 

S(/|jc,-,.,^) = 5o,.a(f)*'- (1) 

The evolution of risk over time is modeled differently for 
each study, as represented by the nonparametric baseline sur- 
vivor function So.s.k (0- The vector p = (Pi, P2, . . . , Pp) rep- 
resents the multivariable adjusted log hazard ratios, assumed 
to be common to all studies, per unit increase in the risk pre- 
dictors Xj. An individual's estimated linear predictor, or risk 
score, is simply jix,- = Pp%' ^nd the person's absolute 
risk of experiencing an event by time / is estimated by 

Fitting the stratified model (equation 1) is a 1 -stage ap- 
proach to model derivation across studies (Figure 1). Alterna- 
tively, a 2-stage approach could be undertaken: First, a 
separate Cox proportional hazards model is fitted in each 
study, and then its coefficients are combined over studies to 
obtain p using either fixed- or random-effects (multivariate) 
meta-analysis (3, 19, 20). A 2-stage random-effects meta- 
analysis has the advantage of allowing for heterogeneity in 
the true coefficients between studies, giving a larger variance 
for p. A multivariate meta-analysis combines estimates for 
the vector of correlated coefficients, taking account of its co- 
variance matrix, over the multiple studies; separate univariate 
meta-analyses ignore the correlations between the coeffi- 
cients. With the 2-stage approach, additional estimation is re- 
quired to obtain study-specific baseline survivor functions 
necessary for making absolute risk predictions. The 1- and 
2-stage approaches often give similar p estimates (21), and 



Model Derivation 



Assessment of Predictive Ability 



2-Stage Approach: Pooling Study-Specific p^ 



2-Stage Approach: Pooling Study-Specific Oj 




1 -Stage Approach: Stratified IVIodel 



1 -Stage Approach: Stratified Calculation 



Figure 1. Overall schemes for model derivation and testing of predictive ability over multiple studies. In the model derivation process, study- 
specific data sets are used to estimate the pooled vector of coefficients (p) for the included risk predictors, either by means of a 1 -stage stratified 
model or by a 2-stage approach applying meta-analysis of study-specific estimates. In assessment of predictive ability, the pooled p is used 
to calculate the pooled discrimination statistic, either using a 1 -stage stratified approach or by meta-analyzing study-specific estimates in a 
2-stage approach. Ws represents study-specific weights applied in meta-analysis approaches; possible choices are described in the text. 
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hence risk scores, and the 1 -stage model (equation 1) has the 
advantage of simphcity. 

The selection of risk predictors may depend on several fac- 
tors, including statistical significance, clinical importance, 
costs, and predictive ability. This paper focuses on the latter. 
The primary descriptions in this paper assume that the time 
scale used is duration of time in the study and that the propor- 
tional hazards assumption is met (3, 22). Assessment of pre- 
dictive ability given more complex model formulations is 
discussed later. 

Example: Deriving risl< prediction models using data from 
the Emerging R/s/c Factors Collaboration. Examples pre- 
sented in this paper are CHD risk models with conventional 
risk predictors and log CRP. Deaths from other causes or 
other nonfatal vascular outcomes (e.g., stroke) are regarded 
as censored observations. There is considerable variation in 
the censoring proportions (which equal 1 minus the event 
proportions) across studies (Web Figure 1). Table 1 shows 
summary statistics and p using each of the described ap- 
proaches. The 1- stage stratified model (equation 1) and the 
2-stage fixed- and random-effects approaches all yield simi- 
lar values for ji; the standard errors for the random-effects 
method are greater, reflecting between-study heterogene- 
ity. We checked the proportional hazards assumption by 
assessing the interaction between log CRP and time in a 
time-dependent Cox model using a 2-stage approach (i.e., 
study-specific interactions were first calculated and then com- 
bined using random-effects meta-analysis (3)). There was no 
evidence against the proportional hazards assumption (17), 



and the 1 -stage prediction model (equation 1) is used for all 
further examples. The 2 corresponding risk scores (without 
and with log CRP) are given in the footnotes of Table 1 . 
The distributions of the linear predictors are approximately 
normally distributed (Web Table 2). 

Measures of discrimination 

Measures of discrimination quantify the degree to which a 
model can predict the order of events. Two such measures are 
the concordance index (C index) (22, 23) and the discrimina- 
tion measure (D measure) (24), which are pertinent because 
of their relevant interpretation, familiarity for the intended 
clinical and epidemiologic audience, and low sensitivity to 
censoring in the absence of marked skewness of the linear 
predictor. In a single unstratified study, the C index estimates 
the probability of concordance between predicted risk and the 
observed order of events for a randomly selected pair of par- 
ticipants (22, 23). Only informative pairs (where it is possible 
to determine which person suffered the first event) are used: 
This introduces some sensitivity to censoring (25), which we 
ignore because currently available solutions either assume 
correct model specification (25) or do not accommodate cen- 
soring by the end of follow-up (26). The D measure estimates 
the mean log hazard ratio for the event of interest for a ran- 
domly selected pair of participants (for one individual in the 
top half of the predicted risk distribution versus another indi- 
vidual in the bottom half) (24). The variance of the C index 
can be calculated by bootstrapping or by means of a jackknife 



Table 1. Characteristics of Study Participants in the Emerging Risl< Factors Collaboration and Comparison of Log Hazard Ratios for Coronary 
Heart Disease in Multivariable-Adjusted Models 







No. of 
Subjects 




Multivariable-Adjusted Log HR^ (SE) 


Heterogeneity 




Mean (SD) 


% 


1 -Stage Stratified 
Model"' 


2-Stage Fixed- 
Effects Model" 


2-Stage Random- 
Effects Model" 




95% CI 


Age at sun/ey, years 


64.2 (8.6) 






0.567 (0.013) 


0.565 (0.013) 


0.529 (0.043) 


76 


67, 82 


Male sex 




81,732 


49 


NA 


NA 


NA 


NA 


NA 


Current smoking® 




35,577 


21 


0.516 (0.024) 


0.529 (0.024) 


0.515 (0.050) 


63 


48, 73 


Systolic blood pressure, mm Hg 


131 (19) 






0.202 (0.009) 


0.203 (0.009) 


0.211 (0.017) 


30 


0, 53 


History of diabetes® 




10,790 


7 


0.557 (0.038) 


0.587 (0.037) 


0.600 (0.049) 


24 


0, 49 


Total cholesterol, mmol/L 


5.84 (1.06) 






0.234 (0.010) 


0.235 (0.010) 


0.216(0.018) 


32 


0, 54 


HDL cholesterol, mmol/L 


1 .27 (0.38) 






-0.247 (0.014) 


-0.240 (0.014) 


-0.232 (0.023) 


52 


32, 67 


Log CRP, mg/L 


0.55 (1.09) 






0.206 (0.012) 


0.207 (0.012) 


0.201 (0.013) 


9 


0, 38 



Abbreviations: CI, confidence interval; CRP, C-reactive protein; HDL, high-density lipoprotein; HR, hazard ratio; NA, not applicable; SD, standard 
deviation; SE, standard error. 

^ Log HR for coronary heart disease per 1-SD increase or in comparison with the relevant reference category, using data from 37 studies 
(165,856 participants with 8,806 cases of coronary heart disease). 

Implies that log HRs were estimated using the stratified model described in equation 1 . 

The 1 -stage stratified model was used to construct the following risk scores (note that log HRs now represent a 1-unit increase in continuous risk 
factors) — risk score without CRP: 0.068 x age + 0.576 x smoker + 0.01 2 x systolic blood pressure + 0.584 x diabetic + 0.221 x total cholesterol - 
0.756xHDL cholesterol; risk score with CRP: 0.066 x age + 0.51 6 x smoker + 0.01 1 x systolic blood pressure + 0.557 x diabetic + 0.220 x total 
cholesterol -0.652 x HDL cholesterol + 0.1 89 x log CRP. 

Implies that log HRs were estimated by meta-analyzing study-specific estimates assuming fixed or random effects, respectively. 
" Reference categories were non-current smoker for smoking and nondiabetic for history of diabetes. 
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procedure (27), whereas the variance of the D measure is sim- 
ply a log hazard ratio variance. When stratification (e.g., by 
study and sex) is used, the selection of pairs is constrained 
to be within the same stratum. 



Calculation of measures of discrimination using 
multiple studies 

A 2-stage approach can be used to estimate discrimination 
measures over multiple studies (Figure 1). Firstly, the dis- 
crimination measure is estimated within each study s, denoted 
by Qs, with corresponding standard error (SB) 6^ . The study- 
specific estimates are then combined using a weighted aver- 
age to obtain the pooled estimate 9: 



9 = 



with SE 6a 



2' 



(2) 
(3) 



where Wj is the weight applied to each study's estimate. The 
estimated 9 represents the discrimination measure when pairs 
of individuals are selected via a 2-stage sampling scheme: 
First one of the S studies is selected with probability propor- 
tional to w„ and then a pair is selected at random from that 
study. We choose Ws as the study-specific number of events, 
since this is the principal determinant of study precision. 
Since this weighting scheme is concerned with sampling 
from existing studies only, it is not relevant to allow for het- 
erogeneity in 9, across studies. 

Possible alternative weights include inverse-variance 
weights from a fixed- or random-effects meta-analysis. The 
latter results in a C index which estimates the probability of 
concordance for a randomly selected pair of participants from 
a new study, sampled from a distribution from which the ex- 
isting studies are believed to have come. 

Alternatively, 1 -stage stratified calculations of the discrim- 
ination measure could be used, stratifying by study (4) 
(Figure 1). For the stratified D measure, this gives study 
weights similar to the number of events. For the stratified C 
index, however, studies receive weights according to the 
number of contributing informative pairs, which generally 
depends on the total number of study participants. As a result, 
large studies with few events can receive substantial weight, 
which may be unappealing. 

The impact of heterogeneity on the imprecision of the 
pooled estimate of discrimination can be quantified by the 

statistic, defined as the percentage of variance in the study- 
specific point estimates that is attributable to true between- 
study heterogeneity as opposed to sampling variation (28). 
Values of close to 0% correspond to lack of heterogeneity, 
and values close to 100% correspond to heterogeneity much 
larger than the sampling variation. The primary determinants 
of heterogeneity in study-specific estimates of discrimination 
Qs are: 1) study-specific distributions of the risk predictors, 
with wider ranges of continuous risk predictors leading to 
higher values of 9j, and 2) variation in the relevance of the 
pooled p to individual studies. 



Example: Calculation of C Index and D measure. Web 
Figure 1 illustrates study-specific C indices for a conventional 
CHD prediction model (including all predictors in Table 1 ex- 
cept log CRP), with pooled estimates derived using various 
weighting schemes. The second and third columns show that 
the proportion of events varies from 0.5% to 20%. Weighting 
by the number of informative pairs gives inappropriate weight- 
ing across studies, with 2 large studies (the Reykjavik Study 
and the Women's Health Study) receiving 57% of the com- 
bined weight. When weighting is done by number of events 
or by inverse variance assuming fixed effects, large studies 
with few events are assigned comparatively less weight, but 
the contribution of studies with many events, such as the Reyk- 
javik Study, remains substantial. In contrast, weights assuming 
random effects are more uniformly assigned across studies. 
This is expected in the presence of large between-study hetero- 
geneity, as is the noticeably wider 95% confidence interval for 
the pooled estimate. This latter approach allows calculation of 
a 95% prediction interval (29) to indicate the range of values 
that might be expected in a new study when there is between- 
study heterogeneity (Web Figure 1). 

Similar results are seen with the D measure (Web Figure 2), 
and there is strong correlation between the study-specific C 
index and D measure (Web Figure 3). Heterogeneity in study- 
specific absolute values for both measures is substantial (/^ = 
93% for the C index and f = 91% for the D measure). Meta- 
regression (21, 30) reveals strong positive correlations 
between study-specific 9, (C index or D measure) and the 
standard deviation of age (Figure 2, top panels). After the 
meta-regression adjustment, the value for the C index is 
reduced to 75%, remaining substantial. 

Meta-regression also reveals correlations between study- 
specific Qs and the standard deviation of the prognostic 
index and, to a lesser extent, its skewness and kurtosis 
(Web Figure 4, top panels), probably due to the extensive 
censoring. Calculations are stratified by sex, preventing sex 
from contributing to calculation of 9, and eliminating 
between-study heterogeneity caused by different proportions 
of males and females. values in Table 1 indicate moderate 
heterogeneity in study-specific Pj for some predictors (par- 
ticularly age and smoking), which may also explain some 
of the heterogeneity in 9j. 

Calculating a change in discrimination using 
multiple studies 

Often interest is in the difference in 9 between 2 alternative 
models, denoted A = 9niodei 2 — Qmodei i ■ As before, we com- 
bine study-specific differences in the C index or D measure, 

— model 2 



6j,modeii, with Variances 6^^: 



with SE 6^ = 



(4) 



(5) 



where 6?^ for the C index difference is directly estimable 
using the jackknife procedure (27) and for the D measure 
the difference is obtained using nonparametric bootstrapping. 
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Figure 2. Meta-regression of study-specific concordance index (C index) and discrimination measure (D measure) for model 1 , and subsequent 
changes upon addition of log C-reactive protein, on the study-specific standard deviation (SD) of age. Model 1 included conventional risk factors: 
age, smoking status, systolic blood pressure, history of diabetes, total cholesterol, and high-density lipoprotein cholesterol, and results are stratified 
by sex. The size of each circle represents the inverse variance weight applied to each study in the meta-regression. 



When pooling study-specific ciianges, weighting by the 
number of events is attractive for the same reasons as those 
discussed above for pooling absolute values. This scheme 
also ensures consistency between the difference in the pooled 
model-specific 9 estimates (6n,odei2 — 9modeii) and the result 
obtained by pooling the within-study differences A,. Such 
consistency is not true of inverse-variance weighting ap- 
proaches, which may give a pooled estimate close to the 
null, since small values of Aj tend to have small variances. 

When model 2 extends model 1 through added predictors, 
Aj represents the incremental predictive ability of the added 
predictors, and heterogeneity in A, depends on 1) the study- 
specific distributions of the added predictors (wider ranges 
leading to greater A,) and 2) the relevance of the pooled p 
for the added predictors to individual studies. In addition, 
since the C index has an upper bound of 1, improvements 
in the C index are more difficult to achieve for higher starting 
values. Given heterogeneity in study-specific C indices for 
model 1, we might expect consequent heterogeneity in A,. 
Since the D measure is a log hazard ratio, this potential "ceil- 
ing effect" does not apply. 

Example: Calculating a change In discrimination. For our 
example, model 1 contains conventional CHD risk predictors 
nested within model 2, which additionally contains log CRP 



(Table 1). There are significant increases in the C index and D 
measure upon the addition of log CRP under all weighting 
schemes (Web Figures 5 and 6). The clinical interpretation of 
this is discussed in detail elsewhere (17). Heterogeneity in A, is 
less than that for absolute values §5 (/^ = 0% and = 26% for 
C-index and D-measure changes, respectively). This lack of 
heterogeneity can be attributed to similarity in the distribution 
of log CRP across studies and to homogeneity in for this pre- 
dictor (/" = 9%). Aj is also independent of study-specific age 
range (Figure 2, bottom panels), as well as the standard devia- 
tion, skewness, and kurtosis of the prognostic index (Web Fig- 
ure 4, bottom panels), and we see little impact of the ceiUng 
effect with the C index. The correlation between within-study 
changes in the C index and D measure is strong (Web Figure 3). 

Subgroup-specific measures of discrimination 

Of possible interest are subgroup-specific changes in dis- 
crimination, A,„ = e,„,n,odel 2 - e»,,model 1 for Ml = 1 , . . . , M 

subgroups, upon addition of a new predictor. These can be 
estimated as follows: 1) using equation 1, fit model 1 (without 
the new predictor) and model 2 (with the new predictor and its 
interaction with the subgroup variable); 2) calculate study- and 
subgroup-specific discrimination measures for each model as 
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9.s,«i,modei 1 and 6s,,„,modei2 and their difference. A,,,,,; and 
3) pool A, ^, and their variance estimates across studies as 
in equations 4 and 5 using study weights equal to the number 
of events within the subgroup, to obtain subgroup-specific 
estimates A„, and corresponding standard errors . The 
null hypothesis that A„, is the same across all subgroups can 
be tested using a test with M — 1 degrees of freedom. To 
maintain within-study comparisons between subgroups, only 
studies with data on all subgroup levels are used (e.g., only 
studies with both men and women are used to compare sex- 
specific subgroups). This avoids erroneous conclusions re- 
sulting from between-study comparisons (3). 

Example: Calculating measures of discrimination across 
subgroups. Figure 3 shows that log CRP appears to provide 
less improvement in discrimination among women and non- 
smokers, although these results will require confirmation 
elsewhere before being adopted into clinical guidelines (17). 

Other issues 

Time-dependent risk predictions. Measures of discrimi- 
nation assess how well participants are ranked in terms of 



risk predictions. In a proportional hazards model, where du- 
ration of time in the study is employed as the time scale and 
baseline covariates are used, ranking of ?-year risk does not 
change with t; hence, measures of discrimination are stable 
over time and Px; is sufficient for their calculation. If time- 
dependent covariates are introduced or if nonproportional 
hazards are modeled, the ranking of f-year risk can change 
with /. In such situations, we suggest calculating the predicted 
risk estimates for each individual for a single (or a selection 
of) fixed time point(s) f. Each set of f-year risk predictions 
can be used to rank individuals and calculate r-year-specific 
measures of discrimination using the methods previously de- 
scribed, but with censoring of follow-up at the selected t so 
that only the order of events occurring before t is considered. 
Since measures of discrimination will now change with time, 
careful choice of / is required. If a selection of fixed points / is 
used, then it may be useful to plot the f-year-specific mea- 
sures against f. 

Similar considerations are relevant when using age as the 
time scale (31-33); in this case, participant entry into the 
model is staggered (with entry at starting age) and, again, 
ranking of f-year risk changes with choice of f . Here, risk 



A) 



Variable or 
Subqroup 


No. of No. of 
Studies Participants 


CHD 
Cases 


Sex 








Male 


24 


53,037 


4,189 


Female 


24 


58,199 


2,515 


Smoking Status 






Other 


36 


128,676 


5,403 


Current 


36 


35,507 


3,380 


History of Diabetes 






No 


36 


144,353 


7,491 


Yes 


36 


10,788 


1,004 


Framingham 2008 10- Year CVD Risk 


<10% 


30 


62,715 


932 


10%-<20% 


30 


47,161 


2,527 


>20% 


30 


47,069 


4,808 


Overall 


37 


165,856 


8,806 



P Value for 
Heterogeneity 



<0.0001 



B) 



0.004 



0.752 



0.027 



NA 



1 1 1 [— 

-0.005 0 0.005 0.010 0.015 0.020 

C Index Change (95% CI) 



P Value for 
Heterogeneity 



<0.0001 



0.004 



0.173 



0.147 



NA 



-0.08 -0.04 0 0.04 0.08 0.12 0.16 
D Measure Change (95% CI) 



Figure 3. Changes in the concordance index (C index) (section A) and the discrimination measure (D measure) (section B) upon movement from 
model 1 to model 2 within various population subgroups. Model 1 included conventional risk factors: age, smoking status, systolic blood pressure, 
history of diabetes, total cholesterol, and high-density lipoprotein cholesterol, and results are stratified by sex. Model 2 additionally included log 
C-reactive protein and an interaction term for interaction between this predictor and each subgroup factor. Bars, 95% confidence intervals (CIs). 
CHD, coronary heart disease; CVD, cardiovascular disease; NA, not applicable. 
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from "age-at-entry" to "age-at-entry plus t years" can be used 
to rank participants in calculation of the discrimination mea- 
sure. With this approach, follow-up should be censored at 
"age-at-entry plus t years." A simpler approach is possible 
for the D measure, in which the original algorithm is used 
but with age as the time scale in the Cox models. This will 
yield lower D measures, since age is effectively adjusted 
for in the discrimination calculation. 

Case-control studies. It is possible to estimate predictive 
ability in case-control studies using the area under the re- 
ceiver operating characteristic curve (22). However, matched 
case-control studies create 2 problems: 1) coefficients for 
matched variables are essentially meaningless (or null), lead- 
ing to distortion of risk predictions, and 2) the restricted dis- 
tribution of matched variables means that discrimination 
appears reduced. Hence, as has been reported previously 
(34, 35), values of discrimination obtained are commonly in- 
consistent with those from cohort studies. 

Example: C statistic for nested case-control studies. Figure 
4 illustrates that the C statistic is substantially lower in nested 
age-matched case-control studies than in cohort studies, and 
the corresponding change upon addition of log CRP is 
greater. 

Measures of reclassification 

Measures of reclassification quantify the extent to which 
individuals are more appropriately classified into risk catego- 
ries using a new model versus an old model. Participants are 



placed into predefined risk categories based on their pre- 
dicted absolute risk of experiencing an event by time t ac- 
cording to each model. Reclassification can be quantified 
using the Net Reclassification Index (NRI) (36), which is 
the sum of 2 proportions: 1) the proportion of events by 
time t that move up through the risk categories upon using 
the new model and 2) the proportion of nonevents at time / 
that move down through the risk categories upon using the 
new model. We suggest reporting these 2 meaningful propor- 
tions (as an "event NRI" and "nonevent NRI," respectively) 
along with an overall NRI. Participants censored before t 
years are excluded from these calculations. 

For multiple studies, and having derived the prediction 
model using only studies with at least f years of follow-up, 
a 1 -stage approach for the calculation of the NRI across mul- 
tiple studies can be applied by calculating the 2 proportions 
across all studies. A 2-stage approach could also be taken, in 
which the NRI is calculated within each study before pooling. 
However, study-specific estimates of the NRI can be unstable 
if few participants experience an event and, hence, very few 
events change categories. It is also unclear which weights to 
apply; while weighting by the number of events is intuitively 
sensible for measures of discrimination and for the "event 
NRI" component, it is less relevant for the "nonevent NRI." 
Weighting the "event NRI" and the "nonevent NRI" by the 
number of events and nonevents, respectively, is equivalent 
to the 1-stage approach. Inverse-variance weighting produces 
results closer to the null, since studies with few movements 
between risk categories will have small standard errors and 



A) 



study Design and 



B) 



0.5 0.6 0.7 O.i 
C Statistic (95% CI) 
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Figure 4. Comparison of C statistics for the cohort and case-control study designs. The concordance Index (C Index) Is shown for cohort studies, 
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receive greater weight. The most appropriate weighting 
scheme to use may depend on the similarity of studies in 
terms of risk distribution, which affects the proximity of par- 
ticipant risk predictions to the risk category boundaries and 
hence the degree of movement between categories. Ideally 
the risk distribution within each study should match that of 
the target population, which would make inverse-variance 
weighting more clinically relevant. In the absence of data 
of this type, reweighting calculations to mimic movement 
that would be expected in a standard target population (e. 
g., a standard European population) is a possibility. 

Calculation of the NRI is less feasible for case-control 
studies, since it is not possible to directly derive absolute 
risk estimates from these studies. Matching (particularly by 
age) also affects the proportions of participants placed into 
each category such that they are not representative of the tar- 
get population. 



Example: Calculation of the NRI. Figures 5 and 6 illus- 
trate different weighting schemes in the estimation of the 
NRI for models with and without log CRP using conven- 
tional 10-year risk categories (<10%, 10% to <20%, and 
>20%). Studies with fewer than 10 years of follow-up and 
participants censored before 10 years are excluded, and thus 
the NRI estimates are not directly comparable with the C 
index or D measure. The power of the NRI is also lower, 
since it relies on only a few categories of predicted risk. 
Inverse-variance weighting (particularly fixed effects) gives 
large weight to 1 low-risk study (the Women's Health Study) 
in which the majority of participants (all female) are non- 
events in the lowest risk category. Since there is little move- 
ment between risk categories, the study's NRI estimate has a 
small standard error and receives a large inverse-variance 
weight. Our current recommendation is to use the 1 -stage 
calculation. 



Overall CHD Cases 



Cohort 


No. 


in 10 Years 


MOSWEGOT 


359 


5 


FINRISK97 


1,150 


56 


EAS 


741 


48 


BRUM 


817 


26 


HISAYAMA 


2,577 


60 


MOGERAUG1 


873 


59 


QUEBEC 


1,219 


67 


FINRISK92 


891 


99 


ULSAM 


926 


103 


RANCHO 


1,381 


116 


ROTT 


4,437 


184 


WHS 


23,287 


234 


SHS 


3,112 


280 


USPHS2 


10,715 


305 


EPICNOR 


15,902 


330 


ARIC 


9,326 


374 


KIHD 


2,020 


189 


COPEN 


7,772 


354 


CHS 


4,211 


489 


REYK 


14,927 


666 


HOORN 


525 


19 


Overall {l^= 95%, P< 0.0001) 



Overall no. of events in 10 years WT 
Overall IV fixed-effect WT 
Overall IV random-effect WT 
Overall 1-stage calculation 
Random-effect 95% prediction interval 








SEof 


%WT 


%WT 


%WT 


NRI (95% CI) 


NRI 


Event 1 0 


IV-FE 


IV-RE 


-0.54 


(-1 .59, 0.52) 


0.54 


0.12 


0.03 


2.85 


0.00 


{-1.18, 1.18) 


0.60 


1.38 


0.03 


2.36 


-0.70 


(-2.46, 1 .05) 


0.89 


1.19 


0.01 


1.16 


-0.29 


(-1 .29, 0.70) 


0.51 


0.64 


0.04 


3.11 


0.10 


(-0.38, 0.57) 


0.24 


1.48 


0.16 


8.26 


-0.14 


(-2.01, 1.74) 


0.96 


1.46 


0.01 


1.02 


-0.39 


(-1 .70, 0.93) 


0.67 


1.66 


0.02 


1.95 


0.17 


(-2.32, 2.66) 


1.27 


2.45 


0.01 


0.59 


2.28 


{-0.95, 5.51) 


1.65 


2.55 


0.00 


0.36 


-1.56 


(-3.22, 0.11) 


0.85 


2.87 


0.01 


1.27 


0.09 


(-0.42, 0.60) 


0.26 


4.55 


0.14 


7.71 


0.00 


{-0.02, 0.02) 


0.01 


5.79 


97.13 


16.22 


0.73 


{-2.40, 3.87) 


1.60 


6.92 


0.00 


0.38 


-0.47 


(-0.69,-0.25) 0.11 


7.54 


0.74 


13.47 


-0.18 


(-0.37, 0.01) 


0.10 


8.16 


1.02 


14.13 


-0.12 


(-2.11, 1.87) 


1.01 


9.25 


0.01 


0.92 


-0.30 


(-1.80, 1.20) 


0.77 


4.67 


0.02 


1.54 


0.60 


(0.16, 1.03) 


0.22 


8.75 


0.19 


8.96 


0.76 


(-0.62, 2.15) 


0.70 


12.09 


0.02 


1.79 


-0.65 


(-0.94, -0.35) 


0.15 


16.47 


0.43 


11.96 


0.00 


(NA, NA) 


0.00 


0.47 


0.00 


0.00 



0.02 
-0.01 
-0.14 
-0.14 



-0.35, 0.38) 
-0.03, 0.01) 
-0.33, 0.06) 
-0.27, -0.02) 
-0.70, 0.42) 



100 



100 



100 



10 



Nonevent NRI (95% CI) 



Figure 5. Study-specific estimates of nonevent Net Reclassification Index (NRI) upon application of nnodel 2 versus model 1 and overall estimates 
obtained using a 1 -stage approach and by meta-analysis using 3 alternative weighting schemes in the Emerging Risk Factors Collaboration. Model 
1 included conventional risk factors: age, smoking status, systolic blood pressure, history of diabetes, total cholesterol, and high-density lipoprotein 
cholesterol, and results are stratified by sex. Model 2 additionally included log C-reactive protein. The 3 weighting schemes illustrated are 1 ) number 
of contributing events occurring before 10 years (Event 10), 2) inverse-variance weights assuming fixed effects (IV-FE), and 3) inverse-variance 
weights assuming random effects (IV-RE). There was no reclassification observed among nonevents in the Hoorn Study (shown at the bottom), and 
therefore it does not contribute to the inverse-variance-weighted pooled estimates due to undefined weight. Bars, 95% confidence intervals (CIs). 
CHD, coronary heart disease; NA, not applicable; SE, standard error; WT, weight. Definitions of study names are given in Web Table 1 . 
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Figure 6. Study-specific estimates of event Net Reclassification Index (NRI) upon application of model 2 versus model 1 and overall estimates obtained using a 1 -stage approach and by 
meta-analysis using 3 alternative weighting schemes in the Emerging Risk Factors Collaboration. See the legend of Figure 5 for explanations. There was no reclassification observed 
among events in the MONICA Gdteborg Study (shown at the bottom), and therefore it does not contribute to the inverse-variance-weighted pooled estimates due to undefined weight. Bars, 
95% confidence intervals (CIs). CHD, coronary heart disease; NA, not applicable; SE, standard error; WT, weight. Definitions of study names are given in Web Table 1 . 
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The prospective NRI. The prospective NRI (37) allows 
inclusion of participants censored before t years and requires 
estimating (using Kaplan-Meier methods) the probability of 
having an event among 1) all participants, 2) those who move 
up through the risk categories, and 3) those who move down 
through the risk categories. Its standard error can be obtained 
by bootstrapping. The prospective NRI is generally more sta- 
ble within studies (since censored participants are included, 
increasing numbers), and its calculation lends itself better 
to within-study calculation and the 2-stage approach using 
equations 2 and 3. 

DISCUSSION 

In this paper, we have demonstrated methods for combining 
measures of predictive ability across multiple studies. Relevant 
Stata code (StataCorp LP, College Station, Texas) is available 
from the University of Cambridge (http://www.phpc.cam.ac. 
uk/ceu/research/erfc/stata). We have described assessment of 
the predictive ability of a risk prediction model, the change 
in predictive ability upon moving from one model to another, 
and the comparison of predictive abilities across subgroups of 
the population. Various approaches to weighting estimates of 
predictive ability when pooling across studies have been dis- 
cussed, and we recommend using the number of events in 
each study. Study designs other than the prospective cohort 
study and risk prediction models other than the Cox propor- 
tional hazards model with duration of time in the study as 
the time scale have also been considered. The clinical implica- 
tions of using CRP concentration as an additional predictor of 
CHD risk are considered in detail elsewhere (17). 

We used the C index (22), the D measure (24), and mea- 
sures of reclassification (36) to illustrate the additional pre- 
dictive ability of CRP in prediction of 10-year CHD risk. 
For these data, the C index and D measure led to similar con- 
clusions with similar statistical power, whereas reclassification 
measures were not comparable because of their different inter- 
pretations, use of study-specific absolute risks instead of linear 
predictors, use of risk cutoffs, and exclusion of censored obser- 
vations. Calculation of the D measure and its standard error re- 
quired the least computational time. 

Study-specific estimates of discrimination were dependent 
on the risk predictors' distributions. The implication is that 
any pooled estimate of discrimination represents a value ap- 
plicable to a population with "average" risk predictor ranges. 
Caution should be applied when comparing estimates of dis- 
crimination across studies with large differences in the risk 
predictors' distributions. In our examples, the changes in dis- 
crimination were less heterogeneous and therefore more reli- 
ably combined across studies. 

Other approaches with which to assess prediction models 
exist. These include measures of explained variation, which 
quantify the proportion of variation in the outcome that can 
be explained by the predictors in the model. Few such mea- 
sures appear to adequately deal with censored data, and these 
approaches have proven difficult to adapt to the multistudy 
context (4). Others, such as the Rj^ extension to the D mea- 
sure (38), could be applied. Measures of calibration generally 
compare observed and predicted risks within groups (e.g., 
deciles) and quantify any evidence for lack of model fit 



with a P value (39, 40). Calibration is important for the as- 
sessment of a new proposed model for a target population, 
but it is not the main issue when comparing the predictive 
ability of alternative models. We have also not considered 
validation approaches (41). Internal validation has not been 
necessary, because the overall data sets used are of substantial 
size and overfitting is minimal (42). External validation is 
more relevant when a new risk score is proposed and its gen- 
eralizability is of interest (1). Addressing overfitting is more 
important when combining measures of discrimination using 
inverse-variance weights from a random-effects meta- 
analysis in order to estimate predictive ability in a new study. 

Certain limitations of our proposed methods remain. 
Firstly, our approach does not combine estimates of predic- 
tive ability from nested case-control studies with those from 
cohort studies. Estimates from studies with a case-cohort de- 
sign, however, may be more comparable with those from full 
cohorts (34). Secondly, persons with missing predictors are 
often excluded. Multiple imputation methods (43) applicable 
to multiple studies need further investigation. 

As the scientific benefits of meta-analysis of individual 
participant data become increasingly recognized, there are a 
growing number of collaborative consortia being established. 
The methods presented in this paper provide practical solu- 
tions for assessment of the overall predictive ability of risk 
models, as well as the added value of novel predictors, in 
such collaborative enterprises. 
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